SA2 Group 2: Chua Meng Heng (A0184420M), He Lirong (A0188392N), Heng Jie Kai Joven (A0190134N), Lin Huiwen (A0187319U), Wong Mei Shan (A0187790M)
Given the rise in private car ownerships in recent years, roads are increasingly populated [1]. According to Statista’s Global Consumer Survey, it was observed that 86% of Americans drive their own cars from home to work and back [2]. With a considerable amount of Americans commuting by cars, road safety is getting increasingly important. The Association for Safe International Road Travel [3] reported that US roadway crashes resulted in more than 4.4 million serious injuries and 38,000 deaths annually.
The United States was the safest country in the world in the 1970s. By 2011, the United States dropped from number 1 to number 19 in terms of traffic fatalities per registered vehicle and to number 13 (out of 19 countries providing data) in terms of traffic fatalities for the same travel distance [4]. Additionally, with traffic accidents as the leading cause of death in the US for people aged between 1 to 54 years old, there is a pressing need for improvements to be made to the national traffic system to enhance road safety. The $87 billion loss in productivity due to traffic accidents is also significant, as established by a study done in 2018 [5].
Unsafe road infrastructure has been named by the World Health Organisation as one of the leading causes of road accidents [6]. This means that the design of roads has a considerable impact on their safety. Roads should be designed with adequate facilities for all users, ranging from pedestrians to drivers. Measures such as traffic cameras, safe crossing paths, and other traffic calming measures are essential in minimising the risk of road accidents.
Reducing risk in the road systems requires informed decision-making, supplemented with the appropriate data, by the government, industry, and other organizations. Upon further analysis, we discovered that the data available on traffic accidents are often raw and it is hard to derive any insights without manipulating the data. Therefore, we decided to tackle this issue using the data wrangling and visualization techniques in R.
Our team worked on developing a data visualisation application targeted at the relevant authorities to pinpoint the problems that the US is facing in its national traffic system. This would allow the relevant authorities to push out much more accurate policies to target identified problems in the system. Therefore, our primary target would be to sell the final product to the government or relevant authorities in order to achieve better road safety.
Additionally, this information may garner interest from businesses interested in traffic data, such as navigation apps like Google Maps, allowing us to tap on the B2B market as well. These businesses may be interested in the traffic accident data as this set of analysis can effectively help them to improve their services. It allows them to provide location specific traffic advice such as to warn drivers of areas with high accident rates to drive safely or directing them to other routes if possible. With these improved services, it will attract more consumers to use their application as it provides them analysis of traffic conditions to better plan their route ahead.
In this study, we will be using the US accidents dataset (3.0 million records) collected for the United States from Feb 2016 to Dec 2019 which is available on Kaggle. (https://www.kaggle.com/sobhanmoosavi/us-accidents)
We further sourced for additional datasets to supplement the primary data. They include:
These datasets were used to derive more meaningful insights for a more multi-dimensional and accurate analysis.
Firstly, we load the libraries needed to clean our data.
library(dplyr)
library(XML)
library(rvest)
library(RCurl)
library(readxl)
library(lubridate)
Then, we load our data.
usdata <- read.csv("US_Accidents_Dec19.csv")
california <- usdata %>% filter(State == "CA")
The csv file downloaded from Kaggle contains car accident information for all 49 states of the United States. The data was over 1GB and would require a lot of computational power. Therefore, we have decided to narrow down our focus to a single state to facilitate more efficient analysis with a smaller dataset. Specifically, California was chosen as it is one of the most populous states in the United States with a total population of 39.5 million people. California also observes one of the highest rates of car accidents in the US [7]. The resultant codes and analysis can then be easily manipulated and replicated for the other states and even other countries if similar data are available.
url <- "https://wiki.openstreetmap.org/wiki/TMC/Event_Code_List"
urldata <- getURL(url)
tmc.tables <- readHTMLTable(urldata)
tmc.table <- tmc.tables[[1]]
tmc <- as.data.frame(tmc.table[,1:2],stringsAsFactors=FALSE)
tmc <- tmc[-1,]
california$TMC <- as.factor(california$TMC)
newcalifornia <- left_join(california,tmc,by=c("TMC"="V1"))
colnames(newcalifornia)[50] <- "TMC Code"
For the original set of data, we noticed that the data contains a column- TMC code, also known as ‘Traffic Message Channel’, which provides more detailed information about the car accident. As the number code does not really give us a clear and direct interpretation of the description of a car accident, we have decided to add an extra column about the information captured by this TMC code.
holidays <- read_excel("PublicholidayCA.xlsx")
holidays$Date <- as.Date(holidays$Date)
newcalifornia$date <- date(newcalifornia$Start_Time)
newcalifornia$day <- weekdays(newcalifornia$date)
newcalifornia$weekend <- ifelse(newcalifornia$day=="Saturday" | newcalifornia$day=="Sunday",1,0)
newcalifornia$ph <- ifelse(newcalifornia$date %in% holidays$Date,1,0)
newcalifornia <- newcalifornia %>% mutate(hour = substr(Start_Time,12,13), month = substr(date,6,7), year = substr(date,1,4))
newcalifornia$day <- factor(newcalifornia$day,levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))
newcalifornia$typeDay <- ifelse(newcalifornia$ph == 1, "Public Holiday", ifelse(newcalifornia$weekend==1, "Weekend", "Weekday"))
newcalifornia$typeDay <- factor(newcalifornia$typeDay, levels = c("Weekday", "Weekend", "Public Holiday"))
The dataset contained the start and end time of the accident. From the start time, we extracted information regarding the hour, day, month and year. We further grouped the days into weekdays, weekends and whether it was a public holiday.
w1 <- "Clear"
w2 <- "Cloudy"
w3 <- "Haze/Fog"
w4 <- "Drizzle"
w5 <- "Rain"
w6 <- "Snow"
w7 <- "Hail"
specificweatherlist <- c(w1,w2,w3,w2,w2,w2,w4,NA,w1,w3,w3,w3,w5,w4,w5,w3,w6,w6,w3,w3,w5,w4,w5,w1,
w2,w3,w3,w4,w3,w3,w1,w5,w1,w4,w7,w4,w5,w5,w3,w5,w2,w2,w5,w5,w5,w5,w5,w6,
w6,w2,w4,w5,w1,w3,w3,w1,w6,w6,w5,w3,w3,w5,w3,w5,w6,w1,w3)
weathertypes = unique(newcalifornia$Weather_Condition)
indexes <- match(newcalifornia$Weather_Condition, weathertypes)
newweatherlist <- specificweatherlist[indexes]
newcalifornia <- add_column(newcalifornia, Weather_Type = newweatherlist, .after = "Weather_Condition")
write.csv(newcalifornia,"CA.csv")
In the data, we noticed that there are 70 weather conditions listed and some of these weather conditions are very similar and hence we have grouped them broadly into 7 different weather types. For example, ‘Mostly Cloudy’, ‘Scattered Clouds’ and ‘Partly Cloudy’ could be grouped as one category “Cloudy”. This minimises the amount of categories we will be using for our plots during our analysis. We then exported the data as a new and smaller csv file to work with.
The purpose of the app is to provide a clear and user-friendly overview of the accident dataset. The data is consolidated into 3 different tabs. There is also a drop-down bar at the side to allow users to filter the data based on different years and counties.
The first tab shows an overview of various information regarding the State. From this page, the user can compare and identify counties with the highest number of accidents, accident rate per population, road condition and severity, etc.
The second tab shows the general demographic of a selected county. It also further shows the accident count for each city in the county.
The third tab allows users to get in-depth visualization of the accidents in a particular county for a particular year. The map allows users to identify hotspots for accidents while the various graphs show the trend of the accidents across different variables.
The distribution map allows us to see the accident count across each county at one glance. The graph allows us to compare the accident count and rates across different counties. Using a relative rate instead of an absolute number of accidents can lead to less misleading conclusions. Using the second graph with a log scale, we can have a clearer picture of the accident count across the counties.
Most notably, we observed that Los Angeles (LA) has the highest accident frequency. This result is expected as LA experiences dense traffic as the most populous county in California with most people commuting via private vehicles. However, it is interesting to note that despite LA having the highest number of accidents, Alpine actually has the highest accident rate per 1,000 people, about 2.4 times of LA (55 accidents per 1,000 people).
The data set contained information regarding the levels of severity of the accident, with 1 being least severe and 4 being the most severe impact on traffic. A pie chart was used to give an overview into the distributions of severity level. A large proportion of accidents was observed to be of severity level 2 followed by level 3.
The graph of road conditions showed that the majority of accidents did not occur near a bump, give way sign, stop sign, traffic calming marking or even traffic signals. This trend persisted throughout the four years and could be a sign that little has been done to improve the road safety.
A possible explanation for the lack of such safety measures could be due to the fact that most accidents occur on freeways, where safety measures like bumps may not be feasible. Nevertheless, there is still a significant number of accidents occurring outside freeways, where the presence of such safety measures could have possibly reduced the number of accidents.
On the freeways, authorities could look into adjusting or lowering the speed limit based on the different road structures such as at bends and exits. Outside of freeways, more safety measures can be implemented as they are likely to be more effective in reducing the speed of such drivers.
In addition, authorities could look into more data like victim information to discover some trends at certain locations. For example, a high rate of vehicle collisions with cyclists at particular streets may suggest a need for better mitigation policies, such as building designated bike lanes [8].
An overview table of the county’s demographics and vehicle proportion is displayed to enable a comprehensive view and easier comparison between various counties, relative to graphical representations. Counties with interesting demographics such as high proportions of vulnerable groups (children and elderly) can be identified and analysed to pick out possible causation for accident rates.
Different counties are also ranked based on accident rates for easy identification of counties most/least at risk to facilitate further investigation.
In this section, the functionality and purposefulness of the app will be demonstrated through the analysis of several counties. There were significant findings in each county which could be helpful in assisting the authorities to craft relevant mitigation measures.
Across all the counties, we observed a trend that most accidents occured at the expressways, especially at bends or exits. This is reasonable as the cars are allowed to travel at a much higher speed at speed limit of 65 mph on expressways and 55 mph on two-lane undivided highway (unless otherwise stated) based on Californian laws [9]. Given that expressways are designed for high speed vehicles, there tend to be less road safety amenities such as bumps and traffic signals to control traffic flow, hence might translate to higher accident rates. This further supports our observations above of the lack of traffic regulatory infrastructure in the majority of the accidents. This may indicate that there is a large area of improvement with respect to highway road safety, perhaps with better traffic education, changes to the traffic law or even redesigning of road structure if possible.
However, we understand that it is difficult to come up with a one-size-fits-all approach to tackle the issue of high accident rates on highways. Therefore, our app provides features to help identify hotspot areas, time periods and other relevant factors for each county which could contribute to accident rates. The app enables authorities to better understand county characteristics and conditions and tailor solutions to better cater to each specific county.
Los Angeles has the highest total population at 10,105,518 and total vehicle count at 8,154,560 registered vehicles. It has an accident rate of 23 for every 1,000 people and a vehicle accident rate of 28 accidents per 1,000 vehicles placing it at 6th in terms of accident rate amongst all the counties in California.
From the maps, we observed that the freeways are particularly prone to accidents as seen from the concentration of orange and red. These accidents along the freeways have severity levels ranging from 2 to 4. This observation is consistent with that of a vast majority of the counties, allowing us to draw the conclusion that there tends to be high rates of accidents along highways as stated above in our general findings.
From the plot above, for accidents with lower severity, we can see that there are quite some accidents that happen on the normal roads. This provides signals to relevant authorities that perhaps they may have missed certain factors out by an oversight which could be attributed to such observations. We have identified such observation perhaps is due to California mandating that speed limit to follow the 85th percentile state law in which speed limit will be set according to “prevailing speed” [10] ,[11]. This “prevailing speed” is determined from speed surveys which calculate the speed of cars in the 85th percentile and after this speed is derived, 5 mph is subtracted off to ensure safety. This law has made the assumption that uniformity of speed would enhance safety level and greatly reduce the risk of vehicles colliding with each other.
With many accidents happening on normal roads as seen above, this might be an indication that such speed limit laws may not be adequate as they failed to consider other factors such as road structure and design which could be potentially dangerous to be driving at a certain speed level. Also, consistent with the general findings, the lack of safety measures like bumps, give way signs and traffic calming could have also been contributing factors to high accidents occurrences especially in small roads given that the 85th percentile law for such roads could have been a potentially dangerous law that unintentionally created the phenomenon of drivers driving faster and faster.
We could see that there is rather high accident frequency throughout the day from 6am to about 8pm and with the peak happening around the evening. As Los Angeles is a major city, these accidents reflect the peak hours timings of a typical city where commuters travel to and fro work. This is reinforced by the fact that more accidents occur during weekdays when people are working as compared to weekends or public holidays.
The accident rate rises towards the end of the year - from September to December. This could be attributed to slippery roads due to frost in the morning or wet fallen leaves [12]. Additionally, the number of accidents could be amplified by the foggy conditions which lead to lowered visibility and headlight glare, increased traffic due to schools opening.
As a major city, there is bound to be higher traffic demand. Using the app, the traffic department can have a clearer idea on the timings to deploy more personnels on-site to monitor traffic conditions and catch any drivers who may be driving recklessly.
In addition, since there are high number of registered vehicles and high traffics within LA which may be a contributing factor for higher frequency of accidents, regulatory measures could have been better adopted such as reducing the number of cars on the road through dynamic pricing for road tolls to effectively contain the flow of traffic and potentially reduce accidents.
We also found that $9 billion has been invested in new light rail and subway lines yet there has been a rapid decline in the amount of people commuting via public transport and the amount of riderships are even lower than three decades ago in which bus was the only option for public transportation [13]. Awareness could be low despite the large investments made, hence the authorities can also actively take measures to encourage more people to turn to public transport. This can be done through campaigning or incentives in an attempt to reduce the amount of traffic. Alternatively, they could revise the public transport network to increase accessibility if it was an inhibiting factor to people switching to public transport.
As mentioned earlier, the 85th percentile state law could be a likely reason for more accidents happening on normal roads and hence the policymakers could review the effectiveness of the 85th percentile state law for certain roads which have higher accident frequency to ensure that the speed limit is effective set with considerations to other myriad of factors such as road design and road conditions like bends and junctions. However, regulating speed limit alone may not be sufficient since there may be drivers who may flout such limits and continue to drive at high speed and hence, safety measures like bumps will be more effective in reducing the speed of such drivers.
Alpine has the smallest population size with a population of only 1,101 and 3,264 registered vehicles yet it has the highest accident rate of 55 per 1,000 people over the 4 years.
From the severity map above, we can see that Alpine seems to be a county with many forests and consistent with the general findings, most of the accidents have a severity of level 2 and occur along the highway.
Looking at the monthly distribution of accidents, a very high accident count was observed during December as compared to the rest of the year. This could be due to slippery roads during the winter season. As mentioned earlier, Alpine is a rather forestry area with several nature parks. This makes it a potentially popular tourist hotspot, which could lead to a high influx of tourists during December which happens to be a travel season. This influx of tourists could be an underlying reason for the high accident rate of 55 accidents per 1000 people despite a small local population.
Moreover, more accidents occur during the weekend and public holidays which further supports our claim that Alpine is a tourist hotspot as more tourists may be arriving to these destinations during their free time. These groups of tourists might not be very familiar with the road structure in Alpine and coupled with the lack of safety measures implemented on the roads of Alpine, high rates of accidents may continue to persist if authorities do not actively take safety measures in the prevention of accidents.
When we look at the accidents over the past 4 years, the accident count was maintained at less than 10 from 2016 to 2018 but spiked to 35 in 2019. This calls for concern and the underlying reasons should be uncovered as soon as possible to reduce further accidents.
As the accidents usually happen along 2 major stretches of road, the local police may consider measures that target these hotspots. Possible measures include (1) installing speed cameras along the common accident sites, (2) having more frequent highway patrol, especially during the peak months of september and december, (3) setting lower speed limits and even (4) consider looking into the road structure which might be the issue due to the forested areas.
San Francisco has a population of 883,305 and 492,336 registered vehicles. It has an accident rate of 10 per 1,000 people and is ranked 41 out of 58 in terms of its accident rate across the 4 years.
We can see that many of the accidents occur around March to June and towards the end of the year, from August to December. This may be attributed to the fact that San Francisco is located in an area surrounded by the sea making it especially prone to cloudy and foggy weathers. Hence, poor visibility as a result of cloudy weathers might have resulted in higher frequency of accidents for a majority of the months.
Furthermore, from the graph above, it shows that the majority of accidents also occurred during early morning from 7am to 9am and during the evening. The poor lighting and bad weather conditions such as cloudy and foggy weather will further exacerbate the problem and could probably account for the majority of the accidents.
We also noticed that a huge number of accidents happen on San Francisco Oakland Bay Bridge, especially on the side of Alameda. This could be due to poor design of the road, resulting in high accident rates.
In this particular situation, a high accident rate during cloudy days may indicate that more street lamps or better lighting needs to be put in place, especially in the most affected areas.
In addition, a high rate of accidents during cloudy weather could suggest that drivers need to develop better driving habits due to lower visibility. Bad habits such as tailgating can result in severe chain collisions. Hence, the traffic department should possibly start a campaign to educate the people to practice safe driving habits, such as a larger safe trailing distance and turning on their headlights, especially on cloudy days.
With regards to the San Francisco Oakland Bay Bridge, more detailed conditions of the accidents could be looked into by the traffic department so that they could consider changing the road layout or building another bridge to distribute the traffic going to and from Alameda.
The data from Kaggle provided us with crucial information on car accidents such as the location of accidents (longitude and latitude), severity levels, road conditions and more. However, we felt that our app and analysis could be more complete if more information were provided, such as (1) cause of the accidents, (2) profile of the victims and damages, (3) number and type of vehicles involved, etc. As mentioned previously, we had to source for some external data to aid our analysis and provide more comprehensive information to the government. However, as these datasets were not directly linked to the accidents that occured, they were limited in the amount of insights they can provide.
In addition to population data, demographic information of the victims would allow us to draw more insights with regards to the relationship between the demographics of victims and certain hotspots and variables. The authorities will then be able to get a better understanding of accidents in each county and consider how they could better craft their policies. For example, if the county has a high number of accidents regarding teens, they could work on improving their traffic education system. If a high number of accidents affects the elderly, they could focus their efforts on better facilities for the elderly. Without such data, we decided to focus on the demographics of each county and identify counties with a high proportion of vulnerable groups such as the elderly and children, which will allow the government to take notice of the existing safety measure and assess whether it is adequate or is there a need to enhance their safety level.
It would also be more insightful if we were able to obtain information on the vehicle types involved in these accidents. This could allow us to draw on other variables to have more insightful finding. For instance, we are currently able to identify certain areas that are more prone to accidents due to the icy roads. Certain vehicles might be more affected than others, such as truckers due to their size, which would push the authorities to craft better roads and policies with regards to heavy-duty vehicles.
Furthermore, we were unable to obtain information on the type of accident - whether it is a vehicle to vehicle or vehicle to pedestrian collision. Having a brief description of the accident would allow us to highlight certain persiting trends. If frequency of vehicle to vehicle collision is high, the government could review the existing road structures and signs to mitigate accident rates. If the frequency of vehicle to pedestrian collision is high, it could be attributed to the lack of traffic crossing amenities and the government could look into possible improvements in this aspect.
The data also did not include causes of these accidents, such as drunk driving, speeding, or running the red light. This would help us identify if traffic accidents are caused by inadequate road infrastructure and safety policies or reckless driving behaviour on the drivers’ part. If the accidents are due to reckless driving behaviour on the motorists’ part, the government could look into crafting more stringent laws to inhibit such irresponsible behaviours, provide more education to raise their safety awareness or to step up their deployment of traffic police during specific peak hours or peak months.
The application that we have built can be easily extended to the rest of the United States by using the initial dataset. Other required information like population and vehicle information can be simply obtained from the web.
Furthermore, the app is designed to be able to run the same analysis and visualisation for other countries beyond the United States as long as similar data is used. For example, the data on UK traffic accidents found on Kaggle at https://www.kaggle.com/daveianhickey/2000-16-traffic-flow-england-scotland-wales can be easily adopted for use in our application. Hence, we believe that if we have more comprehensive traffic data from different countries, we would be able to analyse them to bring more insights.
Our app has the potential to be also used in the local context, Singapore. However, more comprehensive traffic data for accidents are currently held by the relevant authorities and not disclosed to the public. If given the opportunity, we can further leverage other current data such as ERP data and traffic data to make use of the current traffic conditions to further improve our app. Even if the Singapore authorities have a system to analyse accidents, such an app would be useful to the private sector, such as private hire companies, delivery companies and online maps, to be able to further improve the efficiency and safety of their drivers by optimizing routes that are safer or less congested.
Through simple data manipulation, visualisation and analysis, we are able to make a large amount of raw data on the US traffic accidents and other relevant information found online into much more concise, insightful and meaningful information. Additionally, making use of Shiny App, we were able to present the findings in a straightforward, structured, accessible manner. This allows users to effortlessly retrieve insights, aiding in informed decision making.
[1], “2018 Was the Year of the Car, and Transit Ridership Felt It,” Government Technology State & Local Articles - e.Republic. [Online]. Available: https://www.govtech.com/fs/transportation/2018-Was-the-Year-of-the-Car-and-Transit-Ridership-Felt-It.html. [Accessed: 19-Apr-2020].
[2], F. Richter, “Infographic: Cars Still Dominate the American Commute,” Statista Infographics, 29-May-2019. [Online]. Available: https://www.statista.com/chart/18208/means-of-transportation-used-by-us-commuters/. [Accessed: 19-Apr-2020].
[3], “Road Safety Facts,” Association for Safe International Road Travel. [Online]. Available: https://www.asirt.org/safe-travel/road-safety-facts/. [Accessed: 19-Apr-2020].
[4], L. Evans, “Traffic Fatality Reductions: United States Compared With 25 Other Countries,” American Journal of Public Health, vol. 104, no. 8, pp. 1501–1507, 2014.
[5], S. Fleming, “Traffic congestion cost the US economy nearly $87 billion in 2018,” World Economic Forum. [Online]. Available: https://www.weforum.org/agenda/2019/03/traffic-congestion-cost-the-us-economy-nearly-87-billion-in-2018/. [Accessed: 19-Apr-2020].
[6]. “Road traffic injuries,” World Health Organization. [Online]. Available: https://www.who.int/news-room/fact-sheets/detail/road-traffic-injuries. [Accessed: 19-Apr-2020].
[7], “California car accident statistics,” Winer, Burritt & Tillis LLP, 29-Mar-2019. [Online]. Available: https://www.wmlawyers.com/2017/02/california-car-accident-statistics/. [Accessed: 19-Apr-2020].
[8], J. Linton, A. Schmitt, and M. Curry, “L.A. Times Editorial Urges Reform To CA’s ‘Absurd Dangerous Counterproductive’ Speed Law,” Streetsblog Los Angeles, 19-Feb-2020. [Online]. Available: https://la.streetsblog.org/2020/02/19/l-a-times-editorial-urges-reform-to-cas-absurd-dangerous-counterproductive-speed-law/. [Accessed: 19-Apr-2020].
[9]. California Department of Motor Vehicles, “California Driver Handbook - Laws and Rules of the Road,” California Driver Handbook - Laws/Rules of the Road. [Online]. Available: https://www.dmv.ca.gov/portal/dmv/detail/pubs/hdbk/speed_limits . [Accessed: 19-Apr-2020].
[10]. A. Said, D. Newton, M. Curry, J. Linton, and A. Schmitt, “Legal Obstacles To Safe Streets: California Speed Limit Laws,” Streetsblog Los Angeles, 15-Jun-2016. [Online]. Available: https://la.streetsblog.org/2016/06/15/legal-obstacles-to-safe-streets-california-speed-limit-laws/. [Accessed: 19-Apr-2020].
[11], J. Linton, A. Schmitt, and M. Curry, “L.A. Times Editorial Urges Reform To CA’s ‘Absurd Dangerous Counterproductive’ Speed Law,” Streetsblog Los Angeles, 19-Feb-2020. [Online]. Available: https://la.streetsblog.org/2020/02/19/l-a-times-editorial-urges-reform-to-cas-absurd-dangerous-counterproductive-speed-law/. [Accessed: 19-Apr-2020].
[12], “Beginning of autumn brings new driving hazards: Michigan Personal Injury Attorneys,” Bredell & Bredell. [Online]. Available: https://www.bredell.com/articles/beginning-of-autumn-brings-new-driving-hazards/. [Accessed: 19-Apr-2020].
[13], “Billions spent, but fewer people are using public transportation in Southern California,” Los Angeles Times, 27-Jan-2016. [Online]. Available: https://www.latimes.com/local/california/la-me-ridership-slump-20160127-story.html. [Accessed: 19-Apr-2020]